6 research outputs found
Why do These Match? Explaining the Behavior of Image Similarity Models
Explaining a deep learning model can help users understand its behavior and
allow researchers to discern its shortcomings. Recent work has primarily
focused on explaining models for tasks like image classification or visual
question answering. In this paper, we introduce Salient Attributes for Network
Explanation (SANE) to explain image similarity models, where a model's output
is a score measuring the similarity of two inputs rather than a classification
score. In this task, an explanation depends on both of the input images, so
standard methods do not apply. Our SANE explanations pairs a saliency map
identifying important image regions with an attribute that best explains the
match. We find that our explanations provide additional information not
typically captured by saliency maps alone, and can also improve performance on
the classic task of attribute recognition. Our approach's ability to generalize
is demonstrated on two datasets from diverse domains, Polyvore Outfits and
Animals with Attributes 2. Code available at:
https://github.com/VisionLearningGroup/SANEComment: Accepted at ECCV 202
Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark
We provide a new multi-task benchmark for evaluating text-to-image models. We
perform a human evaluation comparing the most common open-source (Stable
Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI
graduate students evaluated the two models, on three tasks, at three difficulty
levels, across ten prompts each, providing 3,600 ratings. Text-to-image
generation has seen rapid progress to the point that many recent models have
demonstrated their ability to create realistic high-resolution images for
various prompts. However, current text-to-image methods and the broader body of
research in vision-language understanding still struggle with intricate text
prompts that contain many objects with multiple attributes and relationships.
We introduce a new text-to-image benchmark that contains a suite of thirty-two
tasks over multiple applications that capture a model's ability to handle
different features of a text prompt. For example, asking a model to generate a
varying number of the same object to measure its ability to count or providing
a text prompt with several objects that each have a different attribute to
identify its ability to match objects and attributes correctly. Rather than
subjectively evaluating text-to-image results on a set of prompts, our new
multi-task benchmark consists of challenge tasks at three difficulty levels
(easy, medium, and hard) and human ratings for each generated image.Comment: NeurIPS 2022 Workshop on Human Evaluation of Generative Models (HEGM
Beyond the Visual Analysis of Deep Model Saliency
Increased explainability in machine learning is traditionally associated with lower performance, e.g. a decision tree is more explainable, but less accurate than a deep neural network. We argue that, in fact, increasing the explainability of a deep classifier can improve its generalization. In this chapter, we survey a line of our published work that demonstrates how spatial and spatiotemporal visual explainability can be obtained, and how such explainability can be used to train models that generalize better on unseen in-domain and out-of-domain samples, refine fine-grained classification predictions, better utilize network capacity, and are more robust to network compression